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(57) Abstract 

An associative text search and retrieval system (30) uses one or more front end processors (56-58) to interact with a network 
(62) having one or more user terminals (64-66) connected thereto to allow a user to provide information to the system (30) and receive 
information from the system. The system (30) also includes storage (46-49) for a plurality of text documents, and at least one processor 
(42-44), coupled to the front end processors (56-58) and the document storage (46-49). Each of the processors (42-44) is provided access to 
thesaurus dictionaries (52-54). The processors) (32-35) search the text documents according to a search request provided by the user and 
provide to the front end processor (56-58) a piedetemiined number of retrieved documents containing at least one term of the search request 
The retrieved documents have higher raniks then documents not provided to the front end processor (56-58). The ranks are calculated using 
a formula that varies according to the square of the frequency in each of the text documents of each of the search terms. 
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ASSOCIATIVE TEXT SEARCH AND RETRIEVAL SYSTEM 
Technical Field 

This invention relates to the field of searching and retrieving text documents and 
more particularly to the field of using one or more computers to search a plurality of 
text documents in order to retrieve documents having particular terras and phrases. 

5 

Background Art 

It is known that a large collection of text documents can be searched for 
particular keywords or phrases. A user can provide a single word or phrase or multiple 
words or phrases connected by Boolean connectors such as "AND" or "OR". However, 
10 in many cases, a user must be fairly sophisticated in order to perform searches of 
sufficient complexity in order to retrieve the exact class of documents that the user 
desires without having to perform an excessive number of searches. 

Associative retrieval, a technique for information retrieval developed in the 
15 1960s by Gerard Salton, addresses some of the shortcomings of Boolean searching. 

Automatic Text Processing , (published by Addison Wesley, New York, New York 1988, 
and written by Gerard Salton) provides a description of associative retrieval searching. 
The basic formula used in associative reuieval involves calculating a term weight for 
each term within a search request, and scoring documents in a collection based on the 
20 sum of the weights for the search request terms that occur within the document. The 
two basic weighting factors are known as the term frequency - tf -and the inverse 
document frequency - idf. 

The term frequency is defined as the number of times the term occurs within a 
25 given document. Hence, the term frequency must be calculated for each document 
within the collection. 

The inverse document frequency is defined as the inverse of the number of 
documents in the entire collection which contain the term. Therefore, if df documents 
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within a collection of N documents contain a given term, the idf would be 1/df. The idf 
can be normalized with respect to the number of documents by setting it to log(N/df). 
The idf is calculated for each search request term, but is constant for the collection and 
does not vary by document. The score for a given document is calculated by summing 
5 the product of the tf and the idfs for each search request term contained in the 
document 

However, there are many aspects of Associative Retrieval as described by Salton 
which render it impractical or unwieldy for large scale commercial use for searching 
10 and retrieving document in large databases. Furthermore, most of the work done in the 
area of Associative Retrieval has failed to adequately address aspects relating to human 
interaction and feedback. It is desirable, therefore, to provide an associative text search 
and retrieval system that overcomes the deficiencies of known systems. 

15 Summary of the Invention 

According to the present invention, a user provides a search description 
containing one or more search terms to an associative text search and retrieval system 
that searches a document database to retrieve documents containing at least one of the 
supplied search terms and then ranks each document using a ranking formula that varies 

20 according to the square of the term frequency of each of the search terms in the 
document. The ranking formula can also vary according to the inverse document 
frequency of each search term. The formula can also use a maximum term frequency to 
estimate the size of a document and the maximum document frequency to estimate the 
number of documents in a collection of documents, thus reducing the amount of 

25 processing needed to determine document size and the number of documents in a 

collection. The user can provide mandatory terms which cause the search to only return 
documents that contain those terms. 

The system can employ a thesaurus to provide both synonyms and morphological 
variations of words. Phrases in the search description are detected using a table with a 

30 bitmap indicating possible positions of a word in a phrase and by using a tree having 
nodes corresponding to ID's associated with words in a phrase, the nodes being 
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connected according to the order that the words can appear in a phrase. The system 
optimizes the search by distinguishing between noise words, which are not provided in 
an index for the documents, and frequently used terms, which are provided in the index 
but which are not used in the search. 
5 The system can provide display options for the documents that are retrieved by 

the search, including displaying a window of text that contains the greatest number and 
diversity of search terms and mandatory terms. The system can also display a screen 
indicating which search terms are in which retrieved documents and can display a 
screen that indicates the importance of each term, which varies according to the inverse 
10 of the document frequency of each term. The documents can be sorted according to 
rank or according to a predetermined default method, such as reverse chronological 
order. 

The system can include a plurality of interconnected processors and appropriate " 
data therefore wherein some of the processors perform searches and others of the 
15 processors merge the search data and interact with the user. 

Brief Description of Drawings 

FIG. 1 is a schematic view of a document searching system according to the 
invention. 

20 

FIG. 2 illustrates data stored in a physical document collection. 

FIG. 3 illustrates data stored in an index for a physical document collection. 

25 FIG. 4 illustrates a logical document collection comprised of a plurality of 

physical document collections. 

FIG. 5 illustrates a logical document collection comprised of a plurality of 
subsets of physical document collections. 

30 

FIG. 6 is a screen illustrating entry of a search description. 
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FIG. 7 is a screen illustrating entry of search options. 
FIG. 8 is a screen illustrating entry of mandatory terras. 
5 FIG. 9 is a screen illustrating entry of restrictions. 

FIG. 10 is a screen illustrating displaying of restrictions. 

FIG. 1 1 is a screen illustrating entry of both mandatory terms and restrictions. 

10 

FIG. 12 is a screen illustrating a thesaurus function. 

FIG. 13 is a screen illustrating selection of synonyms and/or morphological 
variations of a term using the thesaurus function. 

15 

FIG. 14 is a screen illustrating mandatory terms, restrictions, and thesaurus 

enuies. 

FIG, 15 is a screen illustrating options for viewing documents retrieved after a 

20 search. 

FIG. 16 is a screen illustrating a "why" function. 
FIG. 17 is a screen illustrating a "where" function. 

25 

FIG. 18 is a flowchart illustrating overall operation of the system according to 
the invention. 

FIG. 19 shows a table used to detect phrases. 

30 

FIG. 20 shows a tree data structure used to detect phrases. 
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FIG. 21 shows a plurality of contiguous words from a search description. 
FIG. 22 is a flowchart illustrating operation of a search algorithm. 

Best Modern for Carrying Out the Invention 

Referring to FIG. 1, a document search and retrieval system 30 allows a user to 
search a subset of a plurality of documents for particular key words or phrases and 
retrieves, for the user to view, documents that correspond to the search request. The 
system 30 comprises a plurality of Search and Retrieval (SR) computers 32-35 
connected via a high speed interconnection 38 to a plurality of Session Administrator 
(SA) computers 42-44. Each of the SR*s 32-35 is connected to one or more document 
collections 46-49, each containing text for a plurality of documents, indexes therefor, 
and other ancillary data. More than one SR can be provided access to a single 
document collection. Also, a single SR can be provided access to more than one 
document collection. The SR*s 32-35 can be implemented using a variety of 
commercially available computers known to one of ordinary skill in the art. such as 
Model EXIOO manufactured by Hitachi Data Systems of Santa Clara California. 

Each of the SA's 42-44 is provided access to data representing phrase and 
thesaurus dictionaries 52-54. The SA*s 42-44 can also be implemented using a variety 
of commercially available computers, such as Models 5990 and 5995 manufactured by 
Amdahl Corporation of Sunnyvale California. The interconnection 38 between the SR's 
and the SA's can be any one of a number of two-way high-speed computer data 
interconnections known to one of ordinary skill in the art, such as the Model 7200-DX 
manufactured by Network Systems Corporation of Minneapolis Minnesota. 

Each of the SA's 42-44 is connected to one of a plurality of front end processors 
56-58 . The front end processors 56-58 provide a connection of the system 30 one or 
more commonly available networks 62 for accessing digital data, such as an X.25 
network, long distance telephone lines, and SprintNet. Connected to the network 62 is a 
plurality of terminals 64-66 which provide user access to the system 30. The terminals 
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64-66 can be dumb terminals thai simply process and display data inputs and outputs, or 
can be one of a variety of readily available stand-alone computers, such as an IBM or 
IBM-compatible Personal Computer. The front end processors 56-58 can be 
implemented by a variety of commercially available devices, such as Models 4745 and 
5 4705 manufactured by the Amdahl Corporation of Sunnyvale California, Note that the 
number of components shown in FIG. 1 are for illustrative purposes only and that the 
system 30 described herein can have any number of SA's. SR*s, front end processors, 
etC; Also, the distribution of processing described herein may be modified and may in 
fact be performed on a single computer without departing from the spirit and scope of 
10 the invention. 

A user wishing to access the system 30 via one of the terminals 64-66 will use 
the network 62 to establish a connection, by means known to one of ordinary skill in 
the art, to one of the front end processors 56-58. The front end processors 56-58 handle 

15 communication with the user terminals 64-66 by providing output data for display by 
the terminals 64-66 and by processing terminal keyboard inputs entered by the user. 
The data output by the front end processors 56-58 includes text and screen commands. 
The front end processors 56-58 support screen control commands, such as the 
commonly known VTIOO commands, which provide screen functionality to the 

20 terminals 64-66 such as clearing the screen and moving the cursor insertion point. The 
front end processors 56-58 can handle other known types of terminals and/or stand- 
alone computers by providing appropriate commands. 

Each of the front end processors 56-58 communicates bidirectionally. by means 
25 known to one of ordinary skill in the art, with the particular one of the SA's 42-44 
connected thereto. It is also possible to configure the system, in a manner known to 
one of ordinary skill in the art, such that one or more of the front end processors can 
communicate with more than one of the SA's 42-44. The front end processors 56-58 
can be configured to "load balance" the SA*s 42-44 in response to data flow patterns. 
30 The concept of load balancing is known to one of ordinary skill in the art 
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Each of the SA's 42-44 contains an application program, described in more 
detail hereinafter, that processes search requests input by a user at one of the terminals 
64-66 and passes the search request information onto one or more of the SR*s 32-35 
which perform the search and returns the results, including the text of the documents, to 
the SA's 42-44. The SA*s 42-44 provide the user with text documents corresponding to 
the search results via the terminals 64-66. For a particular user session (i.e. a single 
user accessing the system via one of the terminals 64-66), a single one of the SA's 42- 
44 will interact with a user through an appropriate one of the front end processors 56- 
58. 



Referring to FIG. 2, data 70 stored in each of the physical document collections 
46-49 consists of document text 72, an index 74, and ancillary document information 
76. The data 70 can be located in one or more files of a computer hard disk storage 
device. The document text 72 portion of the data 70 is comprised of character data 

15 representing text (such as ASCII or EBCDIC character data) for a plurality of - 
documents. Each of the documents that are part of the document text 72 can be 
accessed individually. The index 74 contains a list of terms (words and phrases) that 
are present in all of the documents of the document text 72 along with the locations in " 
the documents of those terms. The ancillary document information 76, described in 

20 more detail hereinafter, contains other information about the documents, such as the 
dates associated with the documents, the source of the documents, etc. 

Referring to FIG. 3, the index 74 for a document collection comprises a plurality 
of entries that relate particular terms (term 1 - termn) to a plurality of locations (loc A - 

25 loc I). The table shown on the left-hand portion of FIG. 3 relates each term to a pair of 
pointers such that terml is related to pointers PI and P2, term2 is related to pointers P3 
and P4, and termn is related to pointers Pr and Ps, The right-hand portion of FIG. 3 
represents a list of all of the locations for all of the terms in the physical document 
collection. The pointers associated with each term point to the first and last locations in 

30 the list in order to correlate the terms in the text of the documents of the physical 
collection with the locations of the terms. For example, FIG. 3 shows terml being 
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located at locations loc A (pointed to by PI) through loc C (pointed to by P2) in the 
list All of the locations in the list between the entry for loc A and the entry for loc C 
indicate separate locations for terml in the document collection. Words and phrases 
which are so common as to be of little value in document searching^ such as the word 
5 "or, are deemed "noise words" and are not included in the index. A list of noise words 
for each physical document collection is stored with the ancillary document information 
76. 

The SR's 32-34 search documents in a physical collection for particular terms by 
10 accessing the index 74. Terms in the search request are matched with terms in the index 
74 in order to find specific documents in the document text 72 which contain the terms 
in the search request. Plural terms are depluralized and stored in their singular form. 
Terms that are submitted for a search are also depluralized. Depluralization is known in 
the art and is described in Program . Vol. 1, no. 3. pp. 130-137, July 1980, which is 
15 incorporated by reference herein. 

A user does not typically search all of the documents of the system 30, but 
rather, chooses a subset of the documents to be searched prior to conducting the search. 
The subset, deemed a logical document collection, corresponds to a particular class or 
20 group of documents having one or more common attributes. For example, all of the 

written opinions of the U. S. Supreme court can be a logical collection. As described in 
more detail below, a logical collection can correspond to a physical collection, can be a 
subset of a physical collection, can be made up of a plurality of physical collections, or 
can be made up of a plurality of subsets of physical collections. 

25 

Referring to FIG. 4, a logical document collection 80 of written opinions of the 
U.S. Supreme Court is comprised of four physical collections 82-85: opinions from 
1700-1900, opinions from 1901-1960, opinions from 1961-1992, and opinions from 
1993. As described above, each of the physical collections 82-85 is comprised of a text 
30 section and an index. If a user is searching all U.S. Supreme Court cases, then the 

particular SA that is interacting with the user will use a lookup table to determine which 
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physical document collections correspond to the selected logical document collection. 
The lookup table correlates logical document collections with physical document 
collections. The SA then sends commands to the appropriate SR(s) to perform the 
search on the appropriate physical document collection(s) and return the results. 

5 

Referring to FIG. 5. a logical document collection 90 represents written opinions 
of the U.S. Supreme Court that deal with tax issues. The logical document collection 
90 does not physically exist by itself in a single location. Rather, the logical document 
collection 90 is comprised of a plurality of subsets 92-95 of the physical collections 82- 
10 85. The subset 92 corresponds to tax issue cases in the U.S. Supreme Court written 
opinions contained in the physical collection 82. Similarly, the subsets 93-95 
respectively represent portions of the physical collections 83-85 that deal with tax issues 

If a user selects the logical collection 90 for searching, the SA that is interacting 
15 with the user determines the physical collections 82-85 for searching, as discussed 
above, and also provides instructions to the SR(s) for constructing a filter that will 
ensure that only documents that are part of the logical collection will be retrieved. A 
filter is a bitmap or a list of documents that indicates to an SR which documents are to * 
be searched and which documents are to be ignored. The SR(s) use the filters in order' 
20 to only return results that correspond to documents that are part of the logical 

collection. Filters can be constructed by the SR*s using tables or by examining the 
ancillary document information 76. For the logical collection 90, filters will cause the 
search to only return documents corresponding to the subsets 92-95 of the physical 
collections 82-85. 

25 

Referring to FIG. 6, a user screen 100 illustrates input of a search request. The 
screen 100 is displayed on a user terminal. A prompt 102 on the screen 100 instructs 
the user to enter a search description. The user types a search description at the 
terminal keyboard into an area 104 of the screen 100. The description can include 
30 quotes around groups of words to provide phrases. The quotes indicate that the phrases 
should be searched as a single term rather than searching the words of the phrase 
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separately. Individual instances of the words will not match the search. A user presses 
the <ENTER> key at the terminal to indicate that he has completed entering the search 
description. 

5 The SA's 42-44 processes the search description by instructing the SR(s) to 

separately search the logical collection chosen by the user for each word in the 
description (with some exceptions, noted later). Each of the terms in the search 
description that are not noise terms are searched separately by the SR(s). A document 
containing at least one of the terms matches the search. Each matched document is then 
10 rated, as described in more detail below, and retrieved for the user to view. 

Referring to FIG. 7, a screen 110 is provided in response to the user entering a 
search description and pressing <ENTER>. The screen includes a search description 
112a and a plurality of options 1 14-118. The first option 114 allows the user to 
15 Enter/Edit mandatory terms. The second option 115 allows the user to Enter/Edit 

restrictions. The third option 116 allows the user to use a thesaurus. The fourth option 

117 allows the user to Edit the search description. The fifth option 118 allows the user 
to change the number of documents retrieved. Note that the order that the options 114- 

118 are presented on the screen is a design choice and that the options can be presented 
20 in any order without departing from the spirit and scope of the invention. 

The options 114-118 allow the user to modify the characteristics of the 
documents retrieved by the search request. A user can select options repeatedly to 
modify the search before instructing the system to perform the search. The first, second, 

25 and third options 114-116 are described in more detail below. The fourth option 117 
allows the user to modify the text of the search description 112a. The fifth option 118 
allows the user to set the number of documents returned by the search. The number of 
documents returned by the search determines the maximum number of documents that 
the user will be able to view after conducting the search. The documents returned will 

30 be the ones with the highest ratings so that, for example, a setting of twenty-five will 

return to the user the twenty-five highest rated documents. Document rating is described 
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in more detail below. Note that it is possible that the search can return a number of 
documents less than the number of return documents set using the fifth option 118 
(including zero documents) if the number of documents that sadsfy the search 
requirements is less than the number set by the fifth option 118. 

5 

Referring to FIG. 8. the screen 110 is shown after the user has selected the first 
option 114 for entering mandatory terms. A mandatory term is a term that must be in 
the document for the document to be returned on the search. Without entering one or 
more mandatory terms, the search returns documents that contain at least one term of 

10 the search description. If the user desires that all documents returned on the search 
contain at least one occurrence of a particular term or terms, then he selects the first 
option 114 and enters the mandatory terms at the terminal keyboard. In the screen 110 
shown in FIG. 8, the user has indicated at a line 122 that the terms "TQM" and 
"usability testing" are mandatory terms. Therefore, all documents returned by the search- 

15 are guaranteed to contain both the term "TQM" and the term "usability testing". 

Mandatory terms are implemented using filters similar to the filters used to construct * 
logical document collections, discussed above. 

It is not necessary that a mandatory term be part of the search description. The 
20 user can provide mandatory terms that are not part of the search description. Mandatory 
terms that are not part of the search description are not used in document ranking 
(described in detail hereinafter) but are used for other features which will be discussed 
herein. 

25 Referring to FIG. 9, a screen 130 is provided in response to the user selecting 

the second option 115 for entering restrictions. The screen 130 shows a plurality of 
restrictions 132-136 that can be selected by the user. A restriction is a constraint on 
the documents returned by the search. For example, the first restriction 132 is a date 
restriction that indicates that the date of the documents that are returned by the search 

30 will be before or after a particular date. Note that the restrictions illustrated for the 
screen 130 are for a particular type of logical collection (written court opinions). For 
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Other types of logical collections and other types of documents, it may be appropriate to 
provide other types of restrictions for the user to select. 

Referring to FIG. 10, the screen 110 is shown after restrictions have been 
5 entered. A line 142 on the screen 110 shows that, for this example, the documents 

returned on the search will be written opinions from the sixth circuit only and that all 
documents will have a date later than 10/1/92. The SA uses the restrictions to instruct 
the SR's to construct filters (as described above). The filters ensure that only 
documents meeting the requirements imposed by the resuictions are returned by the 
10 search. 

Referring to FIG. 11, the screen 110 is shown with both the line 122 for 
mandatory terms and the line 142 for restrictions, thus illustrating that a user can enter 
both restrictions and mandatory terms for a search. Note that in this case, the SA will 
15 instruct the SR to build filters for both the mandatory terms and the restrictions. 

Referring to FIG. 12, a screen 150 is provided in response to the user selecting 
the third option 116 (thesaurus) from the screen 110 shown in FIG. 7. Terms 152 from 
the search description are listed on the screen 150. The user views the terms 152 from 
20 the search description and selects term numbers 154 for which equivalent terms are 
desired. 

Referring to FIG. 13. a screen 160 is provided to a user after the user selects the 
terms of the search description for which equivalent terms are desired. A portion 162 of 

25 the screen 160 shows the term from the search description. Another portion 164 of the 
screen 160 shows morphological variations of the term. A morphological variation of a 
word is a different form of the same word. For example, "condition", "conditioner", 
"conditioning", and "conditioned" are all morphological variations of each other. An 
other portion 166 of the screen 160 shows synonyms for the term from the search 

30 description. The user can choose any number and combination of morphological 

variations and synonyms from the screen 160. The thesaurus function is implemented 
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using a lookup table that contains both synonyms and morphological variations which is 
stored as shown in FIG. 1 (elements 52-54). described above. 

Referring to FIG. 14, the screen 110 is shown with the mandatory term line 122, 
the restrictions line 142 and a search description 112b that contains equivalents (shown 
in parenthesis) for some of the terms in the search description 112b. The equivalents 
are provided by the user after he has selected the third (thesaurus) option 116 and 
provided selected equivalent terms for particular terms in the search description 112b. • 

Referring to FIG. 15, a screen 170 is provided after completion of the search. 
Note that the mandatory term line 122 and the restriction line 142 are part of the screen 
170. A term line 172 shows terms of the search description listed in order of term 
importance. Term importance discussed in more detail hereinafter. Note also that the 
term Une 172 contains an asterisk followed by two terms 173 "HRST" and "CASE". 
The asterisk indicates that the terms which follow are "frequently used terms". Any 
terms that are deemed frequently used terms are eliminated from further processing for 
the search because their value in locating the desired documents is deemed minimal. 

For the search illustrated by the screen 170, the two terms 173 "FIRST" and 
"CASE" were deemed frequently used terms and hence were. not searched. The 
ancillary data 76. shown in connection with FIG. 3, contains a table of frequently used 
terms for each physical document collection. The determination as to which terms go 
into the table is based on a variety of functional factors known to one of ordinary skill 
in the art, including, but not limited to, the frequency of occurrence of a term in the 
physical collection and the relationship between the meaning of a term and the nature of 
the document coUecUon. Note that this mechanism is different from the "noise words", 
discussed above, since noise words aren't even listed in the index for the physical 
document collection. Frequently used terms are listed in the index. Furthermore, noise 
words are completely eliminated from the search description and will not be seen on the 
screen after completion of the search. Also note that if a frequently used term is a 
word, it will not be eliminated if it is part of a phrase. Phrases are discussed in more 
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detail herein. 

The screen 170 also contains a plurality of view options 174-177. Note that the 
order that the options 174-177 are presented on the screen is a design choice and that 

5 the options can be presented in any order without departing from the spirit and scope of 
the invention. Also, other viewing options that are not shown by one of the prompts 
174-177, such as an option to cite the reuieved documents, may be available to the user 
by typing in the appropriate command to execute the option. The first option 174 
allows the user to view a "SuperKWIC" window of the document. The second option 

10 175 provides a screen which shows the user the location of the search terms in the 
documents. The third option 176 brings up a screen that indicates to the user why 
certain documents were chosen. The fourth option 177 provides the user with an 
opportunity to change the order that the retrieved documents are presented. By default, 
the retrieved documents are presented in ranked order (document ranking is described in 

15 more detail hereinafter). 

By choosing the fourth option 177, the user can provide that the documents be 
presented in a different order, such as chronological order and/or from highest court to 
lowest court in the case of written court opinions. Each physical document collection 
20 has associated therewith, in the ancillary information data 76, predetermined information 
indicating how the documents can be presented. The predetermined information is 
based on a variety of functional factors known to one of ordinary skill in the art and 
varies according to the nature of a particular document collection. 

25 The "SuperKWIC" window is a text window where the user can view the most 

indicative portion of the document based on the occurrences and diversity of mandatory 
terms and search terms. A text window is a viewable contiguous portion of a document 
having a particular length, such as, for example, fifteen lines. For each text window, a 
superkwic score is calculated and the window having the highest score becomes the 

30 SuperKWIC window. The score of a window is calculated as follows: For each search 
terra and/or mandatory term in the window, the score is incremented by one. (Note 
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that, as discussed above, a mandatory term is not necessarily a term of the search 
description.) For each search term and/or mandatory terra in the window preceded by a 
different search term and/or mandatory term, the score is incremented by two. For each 
search term and/or mandatory term in the window that is preceded by a different search 
5 term and/or mandatory term which is itself preceded by yet a different search term 
and/or mandatory term, the score is again incremented by two. 

Referring to FIG. 16, a screen 180 shows the result of choosuig the third (why) 
option 176 illustrated in FIG. 15. A first column 182 of the screen 180 lists the terms 

10 of the search description. The terms are listed in order of term importance (discussed in 
detail below). A second column 184 shows the number of documents that were 
retrieved (returned for the search) that contain each term from the column 182. A 
column 186 shows the total number of documents that were examined that contain each-?, 
term listed in the column 182. A column 188 shows the term importance of each of the- 

15 terms listed in the column 182. Calculation of term importance is discussed in more 
detail below. 

The terms listed in the column 182 are listed in order of term importance. At 
the top of the column 182 are mandatory terms (in this case, the term 
20 "DEFAMATION"). Following the mandatory terms are terms in decreasing order of 
importance, as shown in the column 188. At the bottom of the list are frequently used 
terms, which, as discussed above, are not part of the search. Note that the frequently 
used term "A" in column 182 shows in the columns 184,186,188. 

25 Referring to FIG. 17, a screen 190 is provided in response to a user selecting the 

second option (where) 175 shown in FIG. 15. The screen contains a column 192 that 
lists all of the search terms. A row 194 lists the retrieved document numbers (in this 
example the first through the twenty-fifth document). For each document that is 
returned by the search, the screen 190 indicates which terms were present in which 

30 documents by placing an asterisk in the column corresponding to a particular document 
and the row corresponding to a particular term. The absence of an asterisk in a 



BNSDOClD <WO 9514973A1> 



wo 95/14973 



PCTAJS94/13272 



16 

particular row and column combination indicates that the document corresponding to the 
particular column does not contain the term corresponding to the particular row. 

Referring to FIG. 18, a flowchart 200 illustrates the high level flow of the 
5 software. At a first step 202, the user enters a search description. Following the first 
step 202 is a second step 203 for determining phrases. At the step 203, phrases in the 
search description are detected. Phrases are groups of words that are most meaningful 
when searched as a group rather than individually, such as the phrase "product liability". 
The subsequent search is performed only on the detected phrases rather than on the 
10 individual words that make up the phrases. Phrase detection is described in more detail 
hereinafter. 

After the step 203 is a test step 204 where a determination is made as to whether 
the user has provided a search option (i.e. one or more restrictions, mandatory terms, 

15 etc.) as described above. If the user has not provided a search option, control passes 

from the step 204 to another test step 205 where a determination is made as to whether 
the user has requested that the search be performed. For the embodiment of the 
invention illustrated herein, the user indicates a desire to perform a search by pressing 
the <ENTER> key without providing a search option. If the user has not requested that 

20 a search be performed, then control passes from the step 205 back to the step 204. 

If the user has provided a search option at the step 204, control passes from the 
step 204 to a step 207 where the user option selection is processed, as described in 
detail above. Following the step 207 is a test step 208 where a test is made to 

25 determine if the option entered by the user is a modification of the search description or 
is not the first entry of a mandatory term. If so. control passes from the step 208 back 
to the step 204 in order to process more user inputs. Otherwise, control passes from the 
step 208 back to the step 203 in order to detect any additional phrases entered by the 
user prior to processing additional inputs at the step 204. Phrases are not detected after 

30 editing the description or after entering mandatory terms (except the first time) because 
the user may be selecting those options in order to circumvent automatic phrase 
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detection. 

If at the step 205 the user has indicated a desire to perform the search, control 
passes from the step 205 to a step 210 where the search is performed. The search 
5 procedure is described in more detail hereinafter. Following the step 210 is a step 212 
where the search results are displayed, as described above. 



Referring to FIG. 19, a table 220 is used to detect phrases. The table 220 
contains entries for each word 222 that could be part of a phrase, a bitmap 223 
10 indicating possible locations in a phrase of the associated word entry 222. and an ID 
224 used for compressing the representations for each of the words by assigning a 
unique arbitrary number to represent each word. The table 220 is indexed by the word 
entries 222. ^ 

15 The bitmap entries 223 for each word indicate which positions the word could 

appear in a phrase. Each bitmap entry 223 is one byte (eight bits) long. A value of 
one at a particular bit position in the bitmap entry 223 indicates that the word 222 
associated with the bitmap 223 could appear in the corresponding position in a phrase. 
Conversely, a value of zero in a particular position indicates that the word 222 could 

20 not appear at the same position in a phrase. For example, if a bitmap entry 223 for a 
word 222 contains a one bit in the first, third, and sixth bit positions and contains zero 
bits in all of the other positions, then the associated word 222 could be part of a phrase 
only if the word were the first, third, or sixth word of the phrase. 

25 Referring to FIG. 20, a diagram of a tree structure 230 shows a root node 232, a 

plurality of leyel one nodes 234-236, a plurality of level two nodes 237-239, a plurality 
of level three nodes 240-242, and a plurality of level eight nodes 243-245. The break 
between level three and level eight represents a plurality of level four, level five, level 
six and level seven nodes that are part of the tree structure 230 but are not shown in 

30 FIG. 20. Each of the nodes corresponds to one of the ID's 224 from the table 220 of 
FIG. 19. 
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Each of the level one nodes 234-236 represents the first word of a phrase. Each 
of the level two nodes 237-239 represents the second word of a phrase. Generally, 
each of the level N nodes represents the Nth word in a phrase. Connections between the 
nodes represent the traversal of successive words in a phrase. A terminal node 
5 represents the last word in a phrase. Note that in the embodiment illustrated herein, all 
level eight nodes are terminal nodes, thus allowing the bitmap entries (described above) 
to be only one byte in length. In order to save memory space, the tree structure 230 is 
stored using the ID's 224 shown in FIG 19 rather than using the actual words. 

10 Referring to FIG. 21, a successive group of words from the user-supplied search 

description can be examined for phrases. Initially, wordA is first examined by looking 
up word A in the table 220 of FIG. 19 in order to obtain the bitmap entry therefor. If 
the bitmap entry for wordA indicates that wordA could never be the first word of a 
phrase (by having a zero in the first bit position), then wordA is deemed to be not part 

15 of a phrase. Otherwise, the ID entry corresponding to wordA is used to locate the level 
one node in the tree structure 230 corresponding to wordA. 

Next, wordB is examined. If wordA is not part of a phrase, then wordB is 
examined in the same manner as wordA had been examined. Otherwise, wordB is 

20 examined by first checking the corresponding bitmap entry to see if wordB could be the 
second word of a phrase. If not, then wordA is deemed not to be part of a phrase and 
wordB is examined in the same manner as the previous examination of wordA. If, on 
the other hand, wordB could be the second word of a phrase, then the level one node 
corresponding to wordA is examined to see if it connects to a level two node 

25 corresponding to wordB. If not, then wordA is deemed to be not part of a phrase and 
wordB is examined in the manner that wordA had been previously examined. 
Otherwise, wordA and word B are deemed to be the first two words in a phrase and 
processing continues with wordC. 

30 If wordC is found not to be part of the phrase, then the phrase consists of wordA 

followed by wordB and processing continues by testing if wordC is the first word of a 
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new phrase. Otherwise, if wordC is found to be the third word of the phrase that 
begins with wordA and wordB, then processing continues by testing if wordD could be 
the fourth word of the phrase. As each complete phrase is detected, the word after the 
detected phrase is examined to determine if the word after the detected phrase could be 
5 the first word of a new phrase. 

As discussed above in connection with FIG. 2 and FIG. 3, the indices for the 
physical document collections can contain phrases stored as a single term. For phrases 
not stored as a single term, it is possible to search words of the phrase separately. For 
10 example, the phrase "personal injury" can be searched by merging a search for the terra 
"personal" and with a search for the term "injury" and then determining where the term 
"personal" precedes the term "injury" by one location in each of the documents. 

Referring to FIG, 22, a flowchart 260 illustrates the search process. Steps shown 
15 on the left-hand side of the flowchart 260 represent steps that are executed by the SA 
while steps shown on the right-hand side of the flowchart 260 represent steps executed 
by one or more SR's that perform the actual search and retrieval. The dotted lines 
connecting steps on the left-hand side with steps on the right-hand side indicate the 
temporal relationship between steps executed by the SA and steps executed by the SR's. 
20 The flowchart 260 does not indicate dataflow between the SA's and the SR's. 

Processing begins at a first step 262 where the SA provides the search terms and 
instructions for the filters to the SR's. Note that, as discussed above, the SA forms a 
logical collection of documents, selected by the user, by determining the appropriate 
25 physical collections and by providing the instructions to construct the appropriate filters. 
Other filter instructions can be provided in response to the user entering mandatory 
terms or restrictions. 

Following the step 262 is a step 263 where the SR's that are part of the search 
30 each perform the low level search by accessing the indices of the physical collections 
that comprise the selected logical collections as discussed above. Following the step 
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263, the SA's receive document counts from the SR's at a step 264. Document counts 
are the number of documents that contain each of the terms. Therefore, if there are a 
total of N terms being searched, there will be N document count values. Also at the 
step 263, the SR(s) return an indicator for any frequently used terms (described above) 
5 that v^ill be eliminated from further consideration. 

Following the step 264 is a step 265, where for each search term, the SA 
calculates dfi and maxdfi. The quantity dfi represents the number of documents that 
contain term i. The quantity maxdfi represents the maximum number of documents in 
10 which any of the terms of the search appear. Therefore, for any search, there is only a 
single value for maxdfi. Note that the SA calculates dfi and maxdfi since a user's 
logical document collection may comprise a plurality of physical collections at different 
SR's. No single SR could calculate dfi and maxdfi without knowing the document 
counts provided by the other SR's. 

15 

Following the step 265 is an optimization step 266 where any SR's which do not 
contain a single document that matches any of the search terms are eliminated from 
further consideration. That is, an SR. which is part of a physical document collection 
that comprises the user's selected logical collection, is eliminated from further 
20 processing if the SR does not contain any relevant documents. 

Following the step 266 is a step 267 where dfi and maxdfi, calculated by the SA 
at the step 264. are provided to the SR's. Following the step 267* is a step 268 where 
each of the SR's calculates the rank of each document that was located at the search 
25 step 263. The rank of each document is determined by the following summation 
formula: 

f (iiC) ( lo£{t£ivi)l.) (log(JLli^^)) (log(oc)) 

nt log {max tfi-i-i) afi 



where nt is the total number of search terms, ut is the number of those terms that occur 
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in the document being ranked (each distinct term is counted once), and tfi is the number 
of Uraes term i occurs in the document being ranked. The quantity oc represents the 
percentage of occurrences of search terms in a document floating text window 
containing the maximum number of search terms and is calculated by dividing the 
number of occurrences of search terms in the window by the total number of 
occurrences of search terms in the document and then multiplying the result by one 
hundred. The quantities dfi and maxdfi are discussed above. All logs are in base two. 
Note that the quantity logCtfi+l) is squared in the formula, thus weighting the ranking in 
favor of documents having the greatest term frequency. 

At the step 268, each of the SR's returns to the SA the ranking for the top 
ranked N documents where N is the number of documents that the user requested to be 
returned on the search. The number of retrieved documents is set by the user in the 
manner described in detail above. 

Following the step 268, the SA executes the step 269 where all of the document 
rankings returned by the SR's arc merged and sorted by rank. All but the documents 
having the top N ranks, where N is the number of documents requested by the user, are 
eliminated from further consideration. The SA then requests the remaining documents 
from the appropriate SR(s). For example, if the user requests twenty-five documents, 
and if the logical document collection spans three SR's, then each of the three SR's 
would return twenty-five document rankings. The SA would merge the three sets of 
twenty-five document rankings each, sort them by rank, and then eliminate from 
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consideration all but the top twenty-five documents. The SA would then request that 
the appropriate SR(s) return the top ranked documents. 

Following the step 269 is a step 270, executed by the SR. where the SR(s) 
5 retrieve and return to the SA the top ranked documents requested by the SA at the step 
269. 

Following the step 270 is a step 271. executed by the SA. where the term 
importance for each of the terms is calculated. The term importance, which is 
10 discussed above in connection with the column 188 of FIG. 16. equals logjCmaxdfi/dfi). 
normalized so that the values of term importance range from one to one hundred. 

Following the step 271 is a step 272 where the documents are displayed, as 
discussed in detail above. 
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Claims 

1, A system for searching and retrieving documents, comprising: 
a plurality of documents stored as character data; 

an index containing entries correlating terms in the documents with locations 
thereof in the documents; and 

a processor, coupled to said documents and said index, wherein said processor 
ranks said documents according to the square of the log of the frequency of occurrence 
of search terms contained in the documents. 
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Claims 

2. An associative text search and retrieval system, comprising: 

front end processing means for interacting with a network having one or more 
user terminals connected thereto to allow a user to provide information to the system 
and receive information from the system; 

storage means for storing a plurality of text documents; and 
processor means, coupled to the front end processing means and the storage 
means, for performing a search of the text documents using a plurality of search terms 
provided by the user and for providing to the front end processing means a 
predetermined number of retrieved documents that are a subset of the text documents 
and that contain at least one of the search terms, the retrieved documents having higher 
ranks than text documents not provided to the front end processing means, wherein the 
ranks are calculated using a formula that varies according to the square of the frequency 
in each of the text documents of each of the search terms. 

3. An associative text search and retrieval system, according to claim 2, wherein 
the formula also varies according to an inverse document frequency of each of the 
search terras. 

4. An associative text search and retrieval system, according to claim 3, wherein 
the formula is: 
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nt log {maxtfi+l) 



) (log( 



1 +Tnaxdfi 
dfi 



) ) (log (oc) ) 



wherein nt represents a total number of search tenns. ut represents a number of unique 
search terms that occur in a particular one of the text documents, tfi represents a 
number of times search term i occurs in the text document, oc represents a percentage 
of occurrences of search terms in a floating text window containing a maximum number 
of search terms and is calculated by dividing a count of occurrences of search terms in 
the window by a total number of occurrences of search terms in the document and then 
multiplying the result by one hundred, dfi is a count of the text documents that contain 
term i, maxdfi is a maximum number of the text documents in which any of the search.^ 
terms, and all logs are in base two. 

5. An associative text search and retrieval system, comprising: 

front end processing means for interacting with a network having one or more 

user terminals connected thereto to allow a user to provide information to the system 

and receive information from the system; 

storage means for storing a plurality of text documents; 

processor means, coupled to the front end processing means and the storage 

means, for performing a search of the text documents using a plurality of search terms 

provided by the user and for providing to the front end processing means a 

predetermined number of retrieved documents that are a subset of the text documents 

and that contain at least one of the search terms; and 
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a table used to detect phrases, the table containing entries which, for each word 
that can be part of a phrase, indicate a position that the word can occupy in any phrase. 



6. An associative text search and retrieval system, comprising: 

front end processing means for interacting with a network having one or more 
user terminals connected thereto to allow a user to provide information to the system 
and receive information from the system; 

storage means for storing a plurality of text documents; 

processor means, coupled to the front end processing means and the storage 
means, for performing a search of the text documents using a plurality of search terms 
provided by the user and for providing to the front end processing means a 
predetermined number of retrieved documents that are a subset of the text documents 
and that contain at least one of the search terms; and 

a thesaurus having entries for a plurality of words which correlate each word 
with both synonyms and morphological variations. 

7. An associative text search and retrieval system, comprising: 

front end processing means for interacting with a network having one or more 
user terminals connected thereto to allow a user to provide information to the system 
and receive information from the system; 

storage means for storing a plurality of text documents; 

processor means, coupled to the front end processing means and the storage 
means, for performing a search of the text documents using a plurality of search terms 
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provided by the user and for providing to the front end processing means a 
predetermined number of retrieved documents that are a subset of the text documents 
and that contain at least one of the search terms; and 

means for allowing the user to enter mandatory terms which must be present in 
each of the retrieved documents. 

8. An associative text search and retrieval system, comprising: 

front end processing means for interacting with a network having one or more 
user terminals connected thereto to allow a user to provide information to the system 
and receive information from the system; 

storage means for storing a plurality of text documents; 

processor means, coupled to the front end processing means and the storage- 
means, for performing a search of the text documents using a plurality of search terms 
provided by the user and for providing to the front end processing means a 
predetermined number of retrieved documents that are a subset of the text documents 
and that contain at least one of the search terms; 

an index, associated with the text documents, for indicating the locations of 
potential search terms within the text documents; 

means for excluding noise terms from being considered for the search by not 
including noise terms in the index; and 

means for excluding frequently used terms from being considered for the search, 
the frequently used terms being contained in the index and maintained in a list of 
frequently used terms, the frequently used terms being excluded from the search by not 
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9. An associative text search and retrieval system, comprising: 

front end processing means for interacting with a network having one or more 
user terminals connected thereto to allow a user to provide information to the system 
and receive information from the system; 

storage means for storing a plurality of text documents; 

processor means, coupled to the front end processing means and the storage 
means, for performing a search of the text documents using a plurality of search terms 
provided by the user and for providing to the front end processing means a 
predetermined number of retrieved documents that are a subset of the text documents 
and that contain at least one of the search terms; and 

means for providing the user with a screen indicating occurrence or non- 
occurrence of each of the search terms within the retrieved documents. 

10, An associative text search and retrieval system, comprising: 

front end processing means for interacting with a network having one or more 
user terminals connected thereto to allow a user to provide information to the system 
and receive information from the system; 

storage means for storing a plurality of text documents; 

processor means, coupled to the front end processing means and the storage 
means, for performing a search of the text documents using a plurality of search terms 
provided by the user and for providing to the front end processing means a 
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predetennineci number of retrieved documents that are a subset of the text documents 
and that contain at least one of the search terms; and 

means for providing the user with a screen indicating a term importance for each 
of the search terms wherein the term importance varies according to inverse document 
frequency of the search term. 

11. An associative text search and retrieval system, according to claim 10, wherein 
the term importance varies according to logCmaxdfi/dfi), wherein the log is to the base 
two, dfi is a count of the retrieved documents that contain search term i, and maxdfi is 
a maximum number of the retrieved documents in which any of the search terras 
appear. 

12. An associative text search and retrieval system, comprising: 

front end processing means for interacting with a network having one or more 
user terminals connected thereto to allow a user to provide information to the system 
and receive information from the system; 

storage means for storing a plurality of text documents; 

processor means, coupled to the front end processing means and the storage 
means, for performing a search of the text documents using a plurality of search terms 
provided by the user and for providing to the front end processing means a 
predetermined number of retrieved documents that are a subset of the text documents 
and that contain at least one of the search terms; and 

means for allowing the user to select one of many possible orders for presenting 
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the retrieved documents. 

13. An associative text search and retrieval system, comprising: 

front end processing means for interacting with a network having one or more 
user terminals connected thereto to allow a user to provide information to the system 
and receive information from the system; 

storage means for storing a plurality of text documents; 

processor means, coupled to the front end processing means and the storage 
means, for performing a search of the text documents using a plurality of search terms 
provided by the user and for providing to the front end processing means a 
predetermined number of retrieved documents that are a subset of the text documents 
and that contain at least one of the search terms; and 

means for displaying a window of text of one of the retrieved documents, the 
window having a highest score of all possible windows of the retrieved document, the 
score varying according to the number of search terms in the window and the number 
of search terms in the window preceded by a different search term in the window, 

14, An associative text search and retrieval system, according to claim 13, wherein 
the score is calculated by adding one to the score for the number of search terms in the 
window, adding two to the score for each search term in the window that is preceded by 
a different search term, and by adding two to the score for each search term in the 
window that is preceded by a different search term that is also preceded by a different 
search term. 



BNS0OC[0 <WO 9S14973A1> 



wo 95/14973 PCT/US94/ 13272 

31 

15. An associative text search and retrieval system, according to claim 2, wherein 
said processing means comprises: 

at least one Session Administrator (SA) computer coupled to the front end 
processing means; and 

at least one Search and Retrieval (SR) computer coupled to the SA computer and 
to the document storage means, 

wherein the SR computer performs the search on the document storage means 
and returns the retrieved documents to the SA computer and wherein the SA computer 
prompts the user to enter search terms and search options, provides the SR computer 
with the search request, and allows the user to view the retrieved documents returned to 
the SA computer by the SR computer. , ■ f 

16. An associative text search and retrieval system, according to claim 15, wherein 
the search request is provided by the SA computer to more than one SR computer, the 
SR computers calculate document ranks for text documents found in the course of the 
search and return the document ranks to the SA computer, and the SA computer merges 
the ranks and requests the SR computers to return a subset of the text documents having 
the highest overall ranks. 

17. An associative text search and retrieval system, according to claim 2, further 
comprising: 

a table used to detect phrases, the table containing entries which, for each word 
that can be part of a phrase, indicate a position that the word can occupy in any phrase. 
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18. An associative text search and retrieval system, according to claim 2, further 
comprising: 

a thesaurus having entries for a plurality of words which correlate each word 
with both synonyms and morphological variations. 

19. An associative text search and retrieval system, according to claim 2, further 
comprising: 

means for allowing the user to enter mandatory terms which must be present in 
each of the retrieved documents. 

20. An associative text search and retrieval system, according to claim 2, further 
comprising: 

an index, associated with the text documents, for indicating the locations of 
potential search terms within the text documents; 

means for excluding noise terms from being considered for the search by not 
including noise terms in the index; and 

means for excluding frequently used terms from being considered for the search, 
the frequently used terms being contained in the index and maintained in a list of 
frequently used terms, the frequently used terms being excluded from the search by not 
using terms in the list for the search, 

21. An associative text search and retrieval system, according to claim 4. wherein 
the processor means comprises: 
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at least one Session Administrator (SA) computer coupled to the front end 
processing means; and 

at least one Search and Retrieval (SR) computer couple to the SA computer and 
to the document storage means, 

wherein the SR computer performs the search on the document storage means 
and returns the retrieved documents to the SA computer and wherein the SA computer ,^ 
prompts the user to enter search terms and search options, provides the SR computer 
with the search request, and allows the user to view the retrieved documents returned to 
the SA computer by the SR computer. 

22. An associative text search and retrieval system, according to claim 21, wherein ^ v - 
the search request is provided by the SA computer to more than one SR computer, the 

SR computers calculate document ranks for text documents found in the course of the 
search and return the document ranks to the SA computer, and the SA computer merges ^ 
the ranks and requests the SR computers to return a subset of the text documents having ^ 
the highest overall ranks. 

23. An associative text search and retrieval system, according to claim 17, further 
comprising: 

a thesaurus having entries for a plurality of words which correlate each word 
with both synonyms and morphological variations. 

24. An associative text search and retrieval system, according to claim 23, further 
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comprising: 

means for allowing the user to enter mandatory terras which must be present in 
each of the retrieved documents. 

25. An associative text search and retrieval system, according to claim 24, further 
comprising: 

an index, associated with the text documents, for indicating the locations of 
potential search terms within the text documents; 

means for excluding noise terms from being considered for the search by not 
including noise terms in the index; and 

means for excluding frequently used terms from being considered for the search, 
the frequently used terms being contained in the index and maintained in a list of 
frequently used terms, the frequently used terms being excluded from the search by not 
using terms in the list for the search. 

26. An associative text search and retrieval system, according to claim 25, further 
comprising: 

means for providing the user with a screen indicating occurrence or non- 
occurrence of each of the search terms within the retrieved documents. 

27. An associative text search and reuieval system, according to claim 26, further 
comprising: 

means for providing the user with a screen indicating a term importance for each 
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of the search terms wherein the term importance varies according to inverse document 
frequency of the search term. 

28. An associative text search and retrieval system, according to claim 27. wherein 
the term importance varies according to log(maxdfi/dfi)» wherein the log is to the base 
two, dfi is a count of the retrieved documents that contain search term i, and maxdfi is 
a maximum number of the retrieved documents in which any of the search terms 
appear. 

29. An associative text search and retrieval system, according to claim 28. further 
comprising: 

means for allowing the user to select one of many possible orders for presenting 
the reuieved documents. 

30. An associative text search and retrieval system, according to claim 29. further 
comprising: 

means for displaying a window of text of one of the retrieved documents, the 
window having a highest score of all possible windows of the retrieved document, the 
score varying according to the number of search terms in the window and the number 
of search terms in the window preceded by a different search term in the window. 

31. An associative text search and retrieval system, according to claim 30, wherein 
the score is calculated by adding one to the score for the number of search terms in the 
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window, adding two to the score for each search term in the window that is preceded by 
a different search term, and by adding two to the score for each search term in the 
window that is preceded by a different search term that is also preceded by a different 
search term. 

32. An associative text search and retrieval system, according to claim 31, wherein 
the processor means comprises: 

at least one Session Administrator (SA) computer coupled to the front end 
processing means; and 

at least one Search and Retrieval (SR) computer couple to the SA computer and 
to the document storage means, 

wherein the SR computer performs the search on the document storage means 
and returns the retrieved documents to the SA computer and wherein the SA computer 
prompts the user to enter search terms and search options, provides the SR computer 
with the search request, and allows the user to view the retrieved documents returned to 
the SA computer by the SR computer. 

33. An associative text search and retrieval system, according to claim 32, wherein 
the search request is provided by the SA computer to more than one SR computer, the 
SR computers calculate document ranks for text documents found in the course. of the 
search and return the document ranks to the SA computer, and the SA computer merges 
the ranks and requests the SR computers to return a subset of the text documents having 
the highest overall ranks. 
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34. An associative text search and retrieval system, comprising: 

a front end processor connected to a netv^^ork having one or more user terminals 
connected thereto to allow a user to provide information to the system and receive 
information from the system; 

a session administrator (SA) computer, connected to the front end processor, 
containing a software program that prompts the user to provide input to the system, 
formulates a search request based on input provided by the user, and provides the user 
with retrieved text documents; and 

a search and retrieval (SR) computer, coupled to the SA computer, having 
storage for storing a plurality of text documents, and having a software program for 
performing a search of the text documents using a plurality of search terms provided by." 
the user and for providing to the SA computer a predetermined number of retrieved 
documents containing at least one of the search terms, the retrieved documents having 
higher ranks than text documents not provided to the SA computer, wherein the ranks 
are calculated using a formula that varies according to the square of the frequency in 
each of the text documents ^of each of the search terms. 

35. An associative text search and retrieval system, according to claim 34, wherein 
the formula also varies according to an inverse document frequency of each of the 
search terms. 

36. An associative text search and retrieval system, according to claim 35, wherein 
the formula is: 
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nt loqimaxtfi-^l) 



) {log( 



1 +maxdfi 
dfi 



) ) (log (oc) ) 



wherein nt represents a total number of search terms, ut represents a number of unique 
search terms that occur in a particular one of the text documents, tfi represents a 
number of times search term i occurs in the text document, oc represents a percentage 
of occurrences of search terms in a floating text window containing a maximum number 
of search terms and is calculated by dividing a count of occurrences of search terms in 
the window by a total number of occurrences of search terms in the document and then 
multiplying the result by one hundred, dfi is a count of the text documents that contain 
term i, maxdfi is a maximum number of the text documents in which any of the search 
terms, and all logs are in base two. 

37. An associative text search and retrieval system, according to claim 34, further 
comprising: 

a table used to detect phrases, the table containing entries which, for each word 
that can be part of a phrase, indicate a position that the word can occupy in any phrase. 

38. An associative text search and retrieval system, according to claim 34, further 
comprising: 

a thesaurus having entries for a plurality of words which correlate each word 
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with both synonyms and morphological variations. 

39. An associative text search and retrieval system, according to claim 36, further 
comprising: 

a table used to detect phrases, the table containing entries which, for each word 
that can be part of a phrase, indicate a position that the word can occupy in any phrase. 

40. An associative text search and retrieval system, according to claim 36, further 
comprising: 

a thesaurus having entries for a plurality of words which correlate each word 
with both synonyms and morphological variations. ^ 

41. A method of operating an associative text search and retrieval system, 
comprising the steps of: 

performing a search of text documents using a plurality of search terms provided 
by a user; and 

providing the user with a predetermined number of retrieved documents that are 
a subset of the text documents and that contain at least one of the search terms, the 
retrieved documents having higher ranks than text documents not provided, wherein the 
ranks are calculated using a formula that varies according to the square of the frequency 
in each of the text documents of each of the search terms. 



42. A method of operating an associative text search and retrieval system, according 
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to claim 41. wherein the formula also varies according to an inverse document 
frequency of each of the search terms. 

43. A method of operating an associative text search and retrieval system, according 
to claim 42, wherein the formula is: 

g ( log(tifi^l)^ ) (iog(llin|^)) (log(oc)) 

^ nt log {maxtfi+l) dfi 



wherein nt represents a total number of search terms, ut represents a number of unique 
search terms that occur in a particular one of the text documents, tfi represents a 
number of times search term i occurs in the text document, oc represents a percentage 
of occurrences of search terms in a floating text window containing a maximum number 
of search terms and is calculated by dividing a count of occurrences of search terms in 
the window by a total number of occurrences of search terms in the document and then 
multiplying the result by one hundred, dfi is a count of the text documents that contain 
term i, maxdfi is a maximum number of the text documents in which any of the search 
terras, and all logs are in base two. 

44. A method of operating an associative text search and retrieval system, 
comprising the steps of: 

performing a search of text documents using a plurality of search terms provided 
by a user; 

using a table to detect phrases, the table containing entries which, for each word 
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that can be part of a phrase, indicate a position that the word can occupy in any phrase; 
and 

providing the user with a predetermined number of retrieved documents that are 
a subset of the text documents and that contain at least one of the search terras. 

45, A method of operating an associative text search and retrieval system, 
comprising the steps of: 

performing a search of text documents using a plurality of search terms provided 

by a user; 

using a thesaurus having entries for a plurality of words which correlate each 
word with both synonyms and morphological variations; and >^ 

providing the user with a predetermined number of retrieved documents that are 
a subset of the text documents and that contain at least one of the search terras. 

46. A method of operating an associative text search and retrieval system, 
comprising the steps of: 

performing a search of text documents using a plurality of search terms provided 
by a user; 

allowing the user to enter mandatory terms which must be present in each of the 
retrieved documents; and 

providing the user with a predetermined number of retrieved documents that are 
a subset of the text documents and that contain at least one of the search terms. 
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47. A method of operating an associauve text search and retrieval system, 
comprising the steps of: 

performing a search of text documents using a plurality of search terms provided 

by a user; 

indicating locations of potential search terms within the text documents using an 
index which is associated with the text documents; 

excluding noise terms from being considered for the search by not including 
noise terms in the index; 

excluding from being considered for the search frequently used terms contained 
in the index and maintained in a list of frequently used terms, the frequently used terms 
being excluded from the search by not using terms in the list for the search; and 

providing the user with a predetermined number of retrieved documents that are 
a subset of the text documents and that contain at least one of the search terms. 

48. A method of operating an associative text search and retrieval system, 
comprising the steps of: 

performing a search of text documents using a plurality of search terms provided 
by a user; 

providing the user with a predetermined number of reuieved documents that are 
a subset of the text documents and that contain at least one of the search terms; and 

indicating occurrence or non-occurrence of each of the search terms within the 
retrieved documents. 
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49. A method of operating an associative text search and retrieval system, according 
to claim 48, wherein the occurence or non-occurence of each of the search terms within 
the retrieved documents is displayed in eye-readable form. 

50. A method of operating an associative text search and retrieval system, 
comprising the steps of: 

performing a search of text documents using a plurality of search terms provided 

by a user; 

providing the user with a predetermined number of retrieved documents that are 
a subset of the text documents and that contain at least one of the search terras; and 

indicating a term importance for each of the search terms wherein the terra .z " 
importance varies according to inverse document frequency of the search terra. 

51. A method of operating an associative text search and retrieval system, according 
to claim 50, wherein the term importence is displayed in eye-readable form. 

52. A method of operating an associative text search and retrieval system, according 
to claim 51, wherein the term importance varies according to log(raaxdri/dri). wherein 
the log is to the base two. dfi is a count of the retrieved documents that contain search 
term i, and maxdfi is a maximum number of the retrieved documents in which any of 
the search terms appear. 

53. A method of operating an associative text search and retrieval system, 

43 
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comprising the steps of: 

performing a search of text documents using a plurality of search terras provided 

by a user; 

providing the user with a predetermined number of retrieved documents that are 
a subset of the text documents and that contain at least one of the search terms; and 

allowing the user to select one of many possible orders for presenting the 
retrieved documents. 

54. A method of operating an associative text search and reuieval system, 
comprising the steps of: 

performing a search of text documents using a plurality of search terms provided 
by a user; 

providing the user with a predetermined number of retrieved documents that are 
a subset of the text documents and that contain at least one of the search terms; and 
displaying a window of text of one of the retrieved documents, the window 
_ having a highest score of all possible windows of the retrieved document, the score 
varying according to the number of search terms in the window and the number of 
search terms in the window preceded by a different search term in the window. 

55. A method of operating an associative text search and retrieval system, according 
to claim 54, wherein the score is calculated by adding one to the score for the number 
of search terms in the window, adding two to the score for each search term in the 
window that is preceded by a different search term, and by adding two to the score for 
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each search term in the window that is preceded by a different search term that is also 
preceded by a different search term. 
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